Load Libraries¶
We start by importing the core libraries for data analysis and visualization. Pandas will be used for handling data, Numpy will be used for calculation methods, and Matplotlib/Seaborn will support charting.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import statsmodels.stats.anova as sms
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.preprocessing import StandardScaler
# seaborn is a data visualization library built on matplotlib
import seaborn as sns
# set the plotting style
sns.set_style("whitegrid")
Load the data¶
Load the EdGap data set
edgap = pd.read_excel(
'../data/EdGap_data.xlsx',
dtype={'NCESSCH School ID': object})
/Users/congho/SU_Homework/data-5100-foundations-of-data-science/education/venv/lib/python3.13/site-packages/openpyxl/worksheet/_reader.py:329: UserWarning: Unknown extension is not supported and will be removed warn(msg)
school_information = pd.read_csv(
'../data/ccd_sch_029_1617_w_1a_11212017.csv', encoding='unicode_escape'
)
/var/folders/hm/_316rgmd4v763bh_ttk3pzkr0000gn/T/ipykernel_11764/3687044859.py:1: DtypeWarning: Columns (6,9,14,15,18,19,21,22,25,26,29,31,35,39,41,42) have mixed types. Specify dtype option on import or set low_memory=False. school_information = pd.read_csv(
Explore the contents of the data sets¶
Start by looking at the head of each data frame¶
This will let us see the names of the columns and a few example values for each column. We can also check whether the data is in tidy format.
edgap.head()
| NCESSCH School ID | CT Unemployment Rate | CT Pct Adults with College Degree | CT Pct Childre In Married Couple Family | CT Median Household Income | School ACT average (or equivalent if SAT score) | School Pct Free and Reduced Lunch | |
|---|---|---|---|---|---|---|---|
| 0 | 100001600143 | 0.117962 | 0.445283 | 0.346495 | 42820.0 | 20.433455 | 0.066901 |
| 1 | 100008000024 | 0.063984 | 0.662765 | 0.767619 | 89320.0 | 19.498168 | 0.112412 |
| 2 | 100008000225 | 0.056460 | 0.701864 | 0.713090 | 84140.0 | 19.554335 | 0.096816 |
| 3 | 100017000029 | 0.044739 | 0.692062 | 0.641283 | 56500.0 | 17.737485 | 0.296960 |
| 4 | 100018000040 | 0.077014 | 0.640060 | 0.834402 | 54015.0 | 18.245421 | 0.262641 |
pd.set_option('display.max_columns', None)
school_information.head()
| SCHOOL_YEAR | FIPST | STATENAME | ST | SCH_NAME | LEA_NAME | STATE_AGENCY_NO | UNION | ST_LEAID | LEAID | ST_SCHID | NCESSCH | SCHID | MSTREET1 | MSTREET2 | MSTREET3 | MCITY | MSTATE | MZIP | MZIP4 | LSTREET1 | LSTREET2 | LSTREET3 | LCITY | LSTATE | LZIP | LZIP4 | PHONE | WEBSITE | SY_STATUS | SY_STATUS_TEXT | UPDATED_STATUS | UPDATED_STATUS_TEXT | EFFECTIVE_DATE | SCH_TYPE_TEXT | SCH_TYPE | RECON_STATUS | OUT_OF_STATE_FLAG | CHARTER_TEXT | CHARTAUTH1 | CHARTAUTHN1 | CHARTAUTH2 | CHARTAUTHN2 | NOGRADES | G_PK_OFFERED | G_KG_OFFERED | G_1_OFFERED | G_2_OFFERED | G_3_OFFERED | G_4_OFFERED | G_5_OFFERED | G_6_OFFERED | G_7_OFFERED | G_8_OFFERED | G_9_OFFERED | G_10_OFFERED | G_11_OFFERED | G_12_OFFERED | G_13_OFFERED | G_UG_OFFERED | G_AE_OFFERED | GSLO | GSHI | LEVEL | IGOFFERED | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2016-2017 | 1 | ALABAMA | AL | Sequoyah Sch - Chalkville Campus | Alabama Youth Services | 1 | NaN | AL-210 | 100002 | AL-210-0020 | 1.000020e+10 | 100277.0 | P O Box 9486 | NaN | NaN | Birmingham | AL | 35220 | NaN | 1000 Industrial School Road | NaN | NaN | Birmingham | AL | 35220 | NaN | (205)680-8574 | http://www.dys.alabama.gov | 1 | Open | 1 | Open | 03/03/2010 | Alternative School | 4 | No | No | No | NaN | NaN | NaN | NaN | No | No | No | No | No | No | No | No | No | Yes | Yes | Yes | Yes | Yes | Yes | No | No | No | 07 | 12 | High | As reported |
| 1 | 2016-2017 | 1 | ALABAMA | AL | Camps | Alabama Youth Services | 1 | NaN | AL-210 | 100002 | AL-210-0050 | 1.000020e+10 | 101667.0 | P O Box 66 | NaN | NaN | Mt Meigs | AL | 36057 | NaN | 1601 County Rd. 57 | NaN | NaN | Prattville | AL | 36067 | NaN | (334)215-3850 | http://www.dys.alabama.gov | 1 | Open | 1 | Open | 03/03/2010 | Alternative School | 4 | No | No | No | NaN | NaN | NaN | NaN | No | No | No | No | No | No | No | No | No | Yes | Yes | Yes | Yes | Yes | Yes | No | No | No | 07 | 12 | High | As reported |
| 2 | 2016-2017 | 1 | ALABAMA | AL | Det Ctr | Alabama Youth Services | 1 | NaN | AL-210 | 100002 | AL-210-0060 | 1.000020e+10 | 101670.0 | P O Box 66 | NaN | NaN | Mt Meigs | AL | 36057 | NaN | 2109 Bashi Rd Bldg 509 | NaN | NaN | Thomasville | AL | 36784 | NaN | (334)215-3850 | http://www.dys.alabama.gov | 1 | Open | 1 | Open | 03/03/2010 | Alternative School | 4 | No | No | No | NaN | NaN | NaN | NaN | No | No | No | No | No | No | No | No | No | Yes | Yes | Yes | Yes | Yes | Yes | No | No | No | 07 | 12 | High | As reported |
| 3 | 2016-2017 | 1 | ALABAMA | AL | Wallace Sch - Mt Meigs Campus | Alabama Youth Services | 1 | NaN | AL-210 | 100002 | AL-210-0030 | 1.000020e+10 | 101705.0 | P O Box 66 | NaN | NaN | Mount Meigs | AL | 36057 | NaN | 1000 Industrial School Road | NaN | NaN | Mount Meigs | AL | 36057 | NaN | (334)215-6039 | http://www.dys.alabama.gov | 1 | Open | 1 | Open | 03/03/2010 | Alternative School | 4 | No | No | No | NaN | NaN | NaN | NaN | No | No | No | No | No | No | No | No | No | Yes | Yes | Yes | Yes | Yes | Yes | No | No | No | 07 | 12 | High | As reported |
| 4 | 2016-2017 | 1 | ALABAMA | AL | McNeel Sch - Vacca Campus | Alabama Youth Services | 1 | NaN | AL-210 | 100002 | AL-210-0040 | 1.000020e+10 | 101706.0 | 8950 Roebuck Blvd | NaN | NaN | Birmingham | AL | 35206 | NaN | 8950 Roebuck Blvd | NaN | NaN | Birmingham | AL | 35206 | NaN | (205)838-4981 | http://www.dys.alabama.gov | 1 | Open | 1 | Open | 03/03/2010 | Alternative School | 4 | No | No | No | NaN | NaN | NaN | NaN | No | No | No | No | No | No | No | No | No | Yes | Yes | Yes | Yes | Yes | Yes | No | No | No | 07 | 12 | High | As reported |
Use the info method to check the data types, size of the data frame, and numbers of missing values
edgap.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7986 entries, 0 to 7985 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 NCESSCH School ID 7986 non-null object 1 CT Unemployment Rate 7972 non-null float64 2 CT Pct Adults with College Degree 7973 non-null float64 3 CT Pct Childre In Married Couple Family 7961 non-null float64 4 CT Median Household Income 7966 non-null float64 5 School ACT average (or equivalent if SAT score) 7986 non-null float64 6 School Pct Free and Reduced Lunch 7986 non-null float64 dtypes: float64(6), object(1) memory usage: 436.9+ KB
school_information.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 102183 entries, 0 to 102182 Data columns (total 65 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SCHOOL_YEAR 102183 non-null object 1 FIPST 102183 non-null int64 2 STATENAME 102183 non-null object 3 ST 102183 non-null object 4 SCH_NAME 102183 non-null object 5 LEA_NAME 102183 non-null object 6 STATE_AGENCY_NO 102183 non-null object 7 UNION 2533 non-null float64 8 ST_LEAID 102183 non-null object 9 LEAID 102183 non-null object 10 ST_SCHID 102183 non-null object 11 NCESSCH 102181 non-null float64 12 SCHID 102181 non-null float64 13 MSTREET1 102181 non-null object 14 MSTREET2 1825 non-null object 15 MSTREET3 28 non-null object 16 MCITY 102183 non-null object 17 MSTATE 102183 non-null object 18 MZIP 102183 non-null object 19 MZIP4 60419 non-null object 20 LSTREET1 102183 non-null object 21 LSTREET2 582 non-null object 22 LSTREET3 13 non-null object 23 LCITY 102183 non-null object 24 LSTATE 102183 non-null object 25 LZIP 102183 non-null object 26 LZIP4 59145 non-null object 27 PHONE 102183 non-null object 28 WEBSITE 51983 non-null object 29 SY_STATUS 102181 non-null object 30 SY_STATUS_TEXT 102181 non-null object 31 UPDATED_STATUS 102181 non-null object 32 UPDATED_STATUS_TEXT 102181 non-null object 33 EFFECTIVE_DATE 102181 non-null object 34 SCH_TYPE_TEXT 102181 non-null object 35 SCH_TYPE 102181 non-null object 36 RECON_STATUS 102181 non-null object 37 OUT_OF_STATE_FLAG 102179 non-null object 38 CHARTER_TEXT 102179 non-null object 39 CHARTAUTH1 6687 non-null object 40 CHARTAUTHN1 6687 non-null object 41 CHARTAUTH2 116 non-null object 42 CHARTAUTHN2 116 non-null object 43 NOGRADES 102179 non-null object 44 G_PK_OFFERED 102179 non-null object 45 G_KG_OFFERED 102179 non-null object 46 G_1_OFFERED 102179 non-null object 47 G_2_OFFERED 102179 non-null object 48 G_3_OFFERED 102179 non-null object 49 G_4_OFFERED 102179 non-null object 50 G_5_OFFERED 102179 non-null object 51 G_6_OFFERED 102179 non-null object 52 G_7_OFFERED 102179 non-null object 53 G_8_OFFERED 102179 non-null object 54 G_9_OFFERED 102179 non-null object 55 G_10_OFFERED 102179 non-null object 56 G_11_OFFERED 102179 non-null object 57 G_12_OFFERED 102179 non-null object 58 G_13_OFFERED 102179 non-null object 59 G_UG_OFFERED 102179 non-null object 60 G_AE_OFFERED 102179 non-null object 61 GSLO 102179 non-null object 62 GSHI 102179 non-null object 63 LEVEL 102179 non-null object 64 IGOFFERED 102179 non-null object dtypes: float64(3), int64(1), object(61) memory usage: 50.7+ MB
Based on the information of these dataset, we have some overall information that:
- The school information data set is much larger then the EdGap data set. Clearly the EdGap data set does not include all schools.
- There are missing value in EdGap data set.
- Each data set is in a tidy, or long format.
- The data types for the variables of interest look correct, but the school information identifier is an
objectin the EdGap data set and afloat64in the school information data set.
Next we want to perform quick exploratory data analysis to determine whether the data are sufficient to answer our question. If the data are not sufficient, we do not want to waste time doing anything that will not be productive.
Make a pair plot to explore relationships between the variables and regression lines and format the pair plot
fig = sns.pairplot(
edgap.drop(columns="NCESSCH School ID"),
kind="reg",
plot_kws={
"line_kws": {"color": "blue"},
"scatter_kws": {"alpha": 0.5, "color": "k", "s": 7},
}
)
for ax in fig.axes.flat:
if ax.get_xlabel() == 'CT Median Household Income':
ax.ticklabel_format(style='sci', axis='x', scilimits=(0,0)) # Apply scientific notation
ax.set_xlabel(ax.get_xlabel(), fontsize=8, rotation=30, ha='right') # X-axis label size and rotation
ax.set_ylabel(ax.get_ylabel(), fontsize=8) # Y-axis label size
# Rotate x-axis tick labels
plt.setp(ax.get_xticklabels(), rotation=30, ha='right')
plt.show()
Plot a single row
fig = sns.pairplot(
edgap.drop(columns="NCESSCH School ID"),
y_vars=['School ACT average (or equivalent if SAT score)'],
kind="reg",
plot_kws={
"line_kws": {"color": "blue"},
"scatter_kws": {"alpha": 0.5, "color": "k", "s": 7},
}
)
for ax in fig.axes.flat:
if ax.get_xlabel() == 'CT Median Household Income':
ax.ticklabel_format(style='sci', axis='x', scilimits=(0,0)) # Apply scientific notation
ax.set_xlabel(ax.get_xlabel(), fontsize=8, rotation=30, ha='right') # X-axis label size and rotation
ax.set_ylabel(ax.get_ylabel(), fontsize=8) # Y-axis label size
# Rotate x-axis tick labels
plt.setp(ax.get_xticklabels(), rotation=30, ha='right')
plt.show()
From the steps above, we know that:
- There appears to be a relationship between the socioeconomic variables and the ACT score
- There are some out-of-range ACT and percent lunch values that will need to be dealt with
- We should have confidence that it is worthwhile to spend time preparing the data for analysis.
Data Preparation¶
Select relevent subsets of data¶
The school information data set contains many columns. We only need the year, school identity, location, and school type information.
Keep the columns SCHOOL_YEAR, NCESSCH, LSTATE, LZIP, SCH_TYPE_TEXT, LEVEL
school_information = school_information[
['SCHOOL_YEAR', 'NCESSCH', 'LSTATE', 'LZIP', 'SCH_TYPE_TEXT', 'LEVEL', 'CHARTER_TEXT']
]
school_information.head()
| SCHOOL_YEAR | NCESSCH | LSTATE | LZIP | SCH_TYPE_TEXT | LEVEL | CHARTER_TEXT | |
|---|---|---|---|---|---|---|---|
| 0 | 2016-2017 | 1.000020e+10 | AL | 35220 | Alternative School | High | No |
| 1 | 2016-2017 | 1.000020e+10 | AL | 36067 | Alternative School | High | No |
| 2 | 2016-2017 | 1.000020e+10 | AL | 36784 | Alternative School | High | No |
| 3 | 2016-2017 | 1.000020e+10 | AL | 36057 | Alternative School | High | No |
| 4 | 2016-2017 | 1.000020e+10 | AL | 35206 | Alternative School | High | No |
Rename columns¶
We will rename the columns to follow best practices of style and being informative. We will do it before joining data sets to make it obvious that the key has the same name in each data set.
edgap = edgap.rename(
columns={
"NCESSCH School ID": "id",
"CT Unemployment Rate": "rate_unemployment",
"CT Pct Adults with College Degree": "percent_college",
"CT Pct Childre In Married Couple Family": "percent_married",
"CT Median Household Income": "median_income",
"School ACT average (or equivalent if SAT score)": "average_act",
"School Pct Free and Reduced Lunch": "percent_lunch",
}
)
Rename the columns SCHOOL_YEAR, NCESSCH, LSTATE, LZIP, SCH_TYPE_TEXT, LEVEL to year, id, state, zip_code, school_type, and schoo_level
school_information = school_information.rename(
columns={
'SCHOOL_YEAR': 'year',
'NCESSCH': 'id',
'LSTATE': 'state',
'LZIP': 'zip_code',
'SCH_TYPE_TEXT': 'school_type',
'LEVEL': 'school_level',
'CHARTER_TEXT': 'charter'
}
)
edgap.head()
| id | rate_unemployment | percent_college | percent_married | median_income | average_act | percent_lunch | |
|---|---|---|---|---|---|---|---|
| 0 | 100001600143 | 0.117962 | 0.445283 | 0.346495 | 42820.0 | 20.433455 | 0.066901 |
| 1 | 100008000024 | 0.063984 | 0.662765 | 0.767619 | 89320.0 | 19.498168 | 0.112412 |
| 2 | 100008000225 | 0.056460 | 0.701864 | 0.713090 | 84140.0 | 19.554335 | 0.096816 |
| 3 | 100017000029 | 0.044739 | 0.692062 | 0.641283 | 56500.0 | 17.737485 | 0.296960 |
| 4 | 100018000040 | 0.077014 | 0.640060 | 0.834402 | 54015.0 | 18.245421 | 0.262641 |
school_information.head()
| year | id | state | zip_code | school_type | school_level | charter | |
|---|---|---|---|---|---|---|---|
| 0 | 2016-2017 | 1.000020e+10 | AL | 35220 | Alternative School | High | No |
| 1 | 2016-2017 | 1.000020e+10 | AL | 36067 | Alternative School | High | No |
| 2 | 2016-2017 | 1.000020e+10 | AL | 36784 | Alternative School | High | No |
| 3 | 2016-2017 | 1.000020e+10 | AL | 36057 | Alternative School | High | No |
| 4 | 2016-2017 | 1.000020e+10 | AL | 35206 | Alternative School | High | No |
From steps above we have that:
- We selected a subset of columns of the school information data set.
- We renamed the columns for clarity and follow formatting guidelines.
- We are ready to join the data frames.
Join data frames¶
We want to join the DataFrames using the identity of the school as the key. The identity is given by the NCESSCH school identity.
The value is an string in the school information data set after recreated and an object in the EdGap dataset.
We will cast the id column in the EdGap DataFrame as a string to be the same data type as the school information data set. We also changed the data type of zip code from float to object.
school_information['id'] = school_information['id'].astype('object')
school_information.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 102183 entries, 0 to 102182 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 year 102183 non-null object 1 id 102181 non-null object 2 state 102183 non-null object 3 zip_code 102183 non-null object 4 school_type 102181 non-null object 5 school_level 102179 non-null object 6 charter 102179 non-null object dtypes: object(7) memory usage: 5.5+ MB
Join the data frames and call the result df
df = edgap.merge(
school_information,
how='left',
on='id'
)
df.head()
| id | rate_unemployment | percent_college | percent_married | median_income | average_act | percent_lunch | year | state | zip_code | school_type | school_level | charter | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100001600143 | 0.117962 | 0.445283 | 0.346495 | 42820.0 | 20.433455 | 0.066901 | 2016-2017 | DE | 19804 | Regular School | High | Yes |
| 1 | 100008000024 | 0.063984 | 0.662765 | 0.767619 | 89320.0 | 19.498168 | 0.112412 | 2016-2017 | DE | 19709 | Regular School | High | No |
| 2 | 100008000225 | 0.056460 | 0.701864 | 0.713090 | 84140.0 | 19.554335 | 0.096816 | 2016-2017 | DE | 19709 | Regular School | High | No |
| 3 | 100017000029 | 0.044739 | 0.692062 | 0.641283 | 56500.0 | 17.737485 | 0.296960 | 2016-2017 | DE | 19958 | Regular School | High | No |
| 4 | 100018000040 | 0.077014 | 0.640060 | 0.834402 | 54015.0 | 18.245421 | 0.262641 | 2016-2017 | DE | 19934 | Regular School | High | No |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7986 entries, 0 to 7985 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 7986 non-null object 1 rate_unemployment 7972 non-null float64 2 percent_college 7973 non-null float64 3 percent_married 7961 non-null float64 4 median_income 7966 non-null float64 5 average_act 7986 non-null float64 6 percent_lunch 7986 non-null float64 7 year 7898 non-null object 8 state 7898 non-null object 9 zip_code 7898 non-null object 10 school_type 7898 non-null object 11 school_level 7898 non-null object 12 charter 7898 non-null object dtypes: float64(6), object(7) memory usage: 811.2+ KB
Now that we have a merged data frame that contain all information for analysis.
Quality Control¶
There are natural bounds for the numerical variables in the data set. Check the minimum and maximum values in each column.
df.describe()
| rate_unemployment | percent_college | percent_married | median_income | average_act | percent_lunch | |
|---|---|---|---|---|---|---|
| count | 7972.000000 | 7973.000000 | 7961.000000 | 7966.000000 | 7986.000000 | 7986.000000 |
| mean | 0.098730 | 0.568930 | 0.633440 | 52026.905222 | 20.181532 | 0.420651 |
| std | 0.058959 | 0.165704 | 0.196764 | 24228.057079 | 2.595201 | 0.239754 |
| min | 0.000000 | 0.091493 | 0.000000 | 3589.000000 | -3.070818 | -0.054545 |
| 25% | 0.058655 | 0.450828 | 0.523810 | 36597.250000 | 18.600000 | 0.238501 |
| 50% | 0.085649 | 0.554979 | 0.667594 | 46833.500000 | 20.400000 | 0.381570 |
| 75% | 0.123376 | 0.676571 | 0.777135 | 61369.250000 | 21.910867 | 0.575447 |
| max | 0.590278 | 1.000000 | 1.000000 | 226181.000000 | 32.362637 | 0.998729 |
From the information above, we know that the average_act and percent_lunch contain incorrect value as we know from the min value that contain negative value. Next, we need to set out-of-range values to NaN using Numpy library
df.loc[df['average_act'] < 1, 'average_act'] = np.nan
df.loc[df['percent_lunch'] < 0, 'percent_lunch'] = np.nan
We don't want to remove the entire row of incorrect value because we still want the other information so now they contain the NaN. Next we will check the school type and school level.
df['school_type'].value_counts()
school_type Regular School 7885 Alternative School 10 Special Education School 2 Career and Technical School 1 Name: count, dtype: int64
df['school_level'].value_counts()
school_level High 7230 Other 631 Not reported 35 Elementary 2 Name: count, dtype: int64
df['charter'].value_counts()
charter No 7329 Yes 352 Not applicable 217 Name: count, dtype: int64
Since the ACT is for high school, so we keep only the high schools for analysis
df = df.loc[df['school_level'] == 'High']
Next, we need to check for any duplicated rows to prevent any incorrect value
df.duplicated().sum()
np.int64(0)
Identify missing values¶
Let's check how many values of each variable are missing value
df.isna().sum().to_frame(name="Number of Missing Values")
| Number of Missing Values | |
|---|---|
| id | 0 |
| rate_unemployment | 12 |
| percent_college | 11 |
| percent_married | 20 |
| median_income | 16 |
| average_act | 3 |
| percent_lunch | 20 |
| year | 0 |
| state | 0 |
| zip_code | 0 |
| school_type | 0 |
| school_level | 0 |
| charter | 0 |
From the frame above, we can see that we only missing values from the EdGap data set and the school information data set contain all value. Let's convert the number above to percentage of values of each variable
percent_missing = df.isna().mean().round(4) * 100
percent_missing.to_frame(name="Percent Missing Values")
| Percent Missing Values | |
|---|---|
| id | 0.00 |
| rate_unemployment | 0.17 |
| percent_college | 0.15 |
| percent_married | 0.28 |
| median_income | 0.22 |
| average_act | 0.04 |
| percent_lunch | 0.28 |
| year | 0.00 |
| state | 0.00 |
| zip_code | 0.00 |
| school_type | 0.00 |
| school_level | 0.00 |
| charter | 0.00 |
From the frame above, we can see that the percent missing value are really low, but lets check how many states that we collect data
df['state'].nunique()
20
So we know that the data only collected from 20 states due to omission. This is not evident by examining NaN values in the data set. As we focus in ACT and its percent of missing value is really low, so we drop the rows where the ACT score is missing
df = df.dropna(subset=['average_act'])
df.isna().sum().to_frame(name="Number of Missing Values")
| Number of Missing Values | |
|---|---|
| id | 0 |
| rate_unemployment | 12 |
| percent_college | 11 |
| percent_married | 20 |
| median_income | 16 |
| average_act | 0 |
| percent_lunch | 20 |
| year | 0 |
| state | 0 |
| zip_code | 0 |
| school_type | 0 |
| school_level | 0 |
| charter | 0 |
Now that we have all the rows that contain all non-null values. If we drop rows that have NaNs for any socioeconomic variables, then we will negatively affect our analysis using the variableswhere data were present. So, we will not drop the rows in this data set that are missing the socioeconimic variables. We will impute the missing values.
Data Imputation¶
Define the predictor variables to be rate_unemployment, percent_college, percent_married, median_income, percent_lunch, and state.
predictor_variables = [
'rate_unemployment', 'percent_college', 'percent_married', 'median_income', 'percent_lunch', 'state', 'charter'
]
Use the iterative imputer to replace missing values in the columns corresponding to predictor variables in the analysis.
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imputer = IterativeImputer()
Fit the imputer using the numerical predictor variables (this can include dummies for categorical variables). Define the columns to use in the imputation process.
numerical_predictors = df[predictor_variables].select_dtypes(include='number').columns.to_list()
print(numerical_predictors)
['rate_unemployment', 'percent_college', 'percent_married', 'median_income', 'percent_lunch']
Let's fit the imputer from those numerical predictors
imputer.fit(df.loc[:, numerical_predictors])
IterativeImputer()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| estimator | None | |
| missing_values | nan | |
| sample_posterior | False | |
| max_iter | 10 | |
| tol | 0.001 | |
| n_nearest_features | None | |
| initial_strategy | 'mean' | |
| fill_value | None | |
| imputation_order | 'ascending' | |
| skip_complete | False | |
| min_value | -inf | |
| max_value | inf | |
| verbose | 0 | |
| random_state | None | |
| add_indicator | False | |
| keep_empty_features | False |
Now we have learn the relations between those columns and then we can apply the transform method to actually replace the missing values
df.loc[:, numerical_predictors] = imputer.transform(df.loc[:, numerical_predictors])
Next we can check for missing values to make sure non-null values in the data set
df.isna().sum().to_frame(name="Number of Missing Values")
| Number of Missing Values | |
|---|---|
| id | 0 |
| rate_unemployment | 0 |
| percent_college | 0 |
| percent_married | 0 |
| median_income | 0 |
| average_act | 0 |
| percent_lunch | 0 |
| year | 0 |
| state | 0 |
| zip_code | 0 |
| school_type | 0 |
| school_level | 0 |
| charter | 0 |
Now we have cleaned data set and ready for analysis and saved the clean data frame as a csv file
df.to_csv(
'../data/education_clean.csv',
encoding='utf-8-sig',
index=False
)
The cleaned data set save to in ../data/education_clean.csv
Exploratory data analysis¶
Plot the correlation matrix of the numerical variables in the training data to explore relationships between the variables.
predictor_variables = [
'rate_unemployment',
'percent_college',
'percent_married',
'median_income',
'percent_lunch'
]
numerical_predictors = df[predictor_variables].select_dtypes(include='number').columns.to_list()
corr_matrix = df[numerical_predictors + ["average_act"]].corr()
plt.figure(figsize=(10, 5))
sns.heatmap(
corr_matrix, vmax=1, vmin=-1, square=True, annot=True, cmap="viridis"
)
plt.tick_params(labelsize=12)
plt.show()
We can see that all of the socioeconomic predictor variables have non neligible correlation coefficient with the average ACT. In fact, the correlation between percent lunch and average ACT is -0.78, which is quite large in magnitude. Next, let's make pair plots to explore relationships between the variables
fig = sns.pairplot(
data=df,
vars=numerical_predictors + ['average_act'],
hue='charter',
kind='reg',
plot_kws={"scatter_kws": {"alpha": 0.5, "color": "k", "s": 7},},
)
for ax in fig.axes.flat:
if ax.get_xlabel() == 'CT Median Household Income':
ax.ticklabel_format(style='sci', axis='x', scilimits=(0,0)) # Apply scientific notation
ax.set_xlabel(ax.get_xlabel(), fontsize=8, rotation=30, ha='right') # X-axis label size and rotation
ax.set_ylabel(ax.get_ylabel(), fontsize=8) # Y-axis label size
# Rotate x-axis tick labels
plt.setp(ax.get_xticklabels(), rotation=30, ha='right')
plt.show()
as we can see in the bottom row of the pair plot, we note that the relationships are quite similar between the different values of whether a school is a charter school or not. Another thing to note is t look at what is the form of the relationship between each of these socioeconomic variables and the average ACT score.
Identify ourliers¶
We can use the interquartile range to identify ourliers. This is also evident in boxplots of the data. Median income is on a very different scale than the other predictors, so we will make two plots to explore the data.
plt.figure(figsize=(10,3))
fractions = list(numerical_predictors)
fractions.remove('median_income')
sns.boxplot(data=df[fractions], color='k')
plt.ylabel('Proportion', fontsize=15)
plt.tick_params(labelsize=12)
plt.show()
This boxplot does show us that the unemployment rate, percent college, and percent married do have outliers. However the values do not seem so far away from the majority of the data, nor do they seem like they are incorrect values so much that we would want to necessarily exclude them from our analysis.
plt.figure(figsize=(3,3))
sns.boxplot(data=df, y='median_income', color='k')
plt.ylabel('Median income', fontsize=15)
plt.show()
This boxplot of median income also has outliers, which typical of income distributions, these values while being technically ourliers, do not look as they are necessarily incorrect. But we should be mindful that these outliers are present in the data.
Modeling¶
Single input models¶
Fit and assess models predicting the average ACT score from each of the input variables. We might try polynomial linear regression models, as appropriate
Median imcome¶
Plot the regression line and the scatter plot
plt.figure(figsize=(6,6))
sns.regplot(data=df,
x='median_income',
y='average_act',
color='blue',
ci=False,
scatter_kws={'color': 'black', 'edgecolors': 'white', 'linewidths': 1}
)
plt.xlabel('Median income ($)', fontsize=16)
plt.ylabel('Average ACT score', fontsize=16)
plt.tick_params(labelsize=14)
plt.show()
There is a relationship between the median income and the average ACT score, and it appears that this simple linear regression is providing only moderate fit of the data. Let's actually fit the model and then we will asses it using graphical and numerical methods
model_median_income = smf.ols(formula='average_act ~ median_income', data=df).fit()
print(model_median_income.summary())
OLS Regression Results
==============================================================================
Dep. Variable: average_act R-squared: 0.211
Model: OLS Adj. R-squared: 0.211
Method: Least Squares F-statistic: 1934.
Date: Sun, 19 Oct 2025 Prob (F-statistic): 0.00
Time: 15:10:14 Log-Likelihood: -16043.
No. Observations: 7227 AIC: 3.209e+04
Df Residuals: 7225 BIC: 3.210e+04
Df Model: 1
Covariance Type: nonrobust
=================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept 17.8026 0.063 284.794 0.000 17.680 17.925
median_income 4.732e-05 1.08e-06 43.981 0.000 4.52e-05 4.94e-05
==============================================================================
Omnibus: 191.536 Durbin-Watson: 1.273
Prob(Omnibus): 0.000 Jarque-Bera (JB): 361.275
Skew: -0.197 Prob(JB): 3.55e-79
Kurtosis: 4.022 Cond. No. 1.39e+05
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.39e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
This assessment tells us that the intercept in the model is 17.8 and the coefficient on median income is 4.732e-05. It is a very small coefficient because the median income is calculated in dollars and go over a large range, whereas the ACT score is over a much smaller range. We also interested in the statistical significance of the coefficient particularly the one on our predictor, and we can look in the column for the P values to see that they are small and zero up to three decimal places, so we have statistically significant coefficients. Let's compute the R-squared.
model_median_income.rsquared
np.float64(0.21118648979300614)
Compute the RMSE
y_hat = model_median_income.predict()
np.sqrt(mean_squared_error(df['average_act'], y_hat)).round(3)
np.float64(2.228)
Compute the mean absolute error
mean_absolute_error(df['average_act'], y_hat)
1.7129386980688617
From the computation above, the model was not perfect, as we saw, that is a relatively small error in terms of the range of possible values for an ACT score. So it is saying that we are able to in some way predict the ACT score from this particular input variable. We would also like to asses the model using a graphical method and standard approach is to use a residual plot.
plt.figure(figsize=(6, 6))
plt.plot(y_hat, model_median_income.resid, 'ko', mec='w')
plt.axhline(0, color='r', linestyle='dashed', lw=2)
plt.xlabel('Predicted average ACT score', fontsize=16)
plt.ylabel('Residual average ACT score', fontsize=16)
plt.tick_params(labelsize=14)
plt.show()
This plot is not purely a cloud of points, the error between what the actual value is and the prediction, that would suggest to use that an alternative model might give us an improvement over the model. So we might try a more complicated model and we could consider a quadratic polynomial regression model. Plot the regression curves and the scatter plot.
plt.figure(figsize=(6, 6))
sns.regplot(data=df,
x='median_income',
y='average_act',
color='blue',
ci=False,
scatter_kws={'color': 'black', 'edgecolors': 'white', 'linewidths': 1}
)
sns.regplot(data=df,
x='median_income',
y='average_act',
order=2,
color='orange',
ci=False,
scatter=False
)
plt.xlabel("Median income ($)", fontsize=16)
plt.ylabel("Average ACT score", fontsize=16)
plt.tick_params(labelsize=14)
plt.show()
The qudratic model might provide a slightly better fit, but it is not clear that it is going to be significantly better than the simple linear regression. But we should fit the model and then consider the accuracy and the significance of the quadratic model.
model_median_income2 = smf.ols( # type: ignore
formula='average_act ~ median_income + I(median_income**2)',
data=df
).fit()
print(model_median_income2.summary())
OLS Regression Results
==============================================================================
Dep. Variable: average_act R-squared: 0.219
Model: OLS Adj. R-squared: 0.219
Method: Least Squares F-statistic: 1013.
Date: Sun, 19 Oct 2025 Prob (F-statistic): 0.00
Time: 15:10:15 Log-Likelihood: -16007.
No. Observations: 7227 AIC: 3.202e+04
Df Residuals: 7224 BIC: 3.204e+04
Df Model: 2
Covariance Type: nonrobust
=========================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------
Intercept 16.9460 0.118 143.790 0.000 16.715 17.177
median_income 7.63e-05 3.55e-06 21.485 0.000 6.93e-05 8.33e-05
I(median_income ** 2) -1.99e-10 2.33e-11 -8.557 0.000 -2.45e-10 -1.53e-10
==============================================================================
Omnibus: 186.698 Durbin-Watson: 1.302
Prob(Omnibus): 0.000 Jarque-Bera (JB): 395.543
Skew: -0.140 Prob(JB): 1.29e-86
Kurtosis: 4.111 Cond. No. 2.27e+10
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.27e+10. This might indicate that there are
strong multicollinearity or other numerical problems.
The summary shows that the R-squared is 0.219, and it is very slightly higher than the previous model. The coefficient on the squared term is statistically significant. It is not clear how much that is improving the model. So we have seen from this analysis that we do have a significant quadratic term as well as a significant linear term.
model_median_income2.pvalues
Intercept 0.000000e+00 median_income 2.631899e-99 I(median_income ** 2) 1.395848e-17 dtype: float64
We can use an analysis of variance or ANOVA to compare these two nested plynomial linear regression models where we are comparing the simpler model to the more complicated model and statist significant in terms of its difference from the simpler model.
sms.anova_lm(model_median_income, model_median_income2)
| df_resid | ssr | df_diff | ss_diff | F | Pr(>F) | |
|---|---|---|---|---|---|---|
| 0 | 7225.0 | 35865.012794 | 0.0 | NaN | NaN | NaN |
| 1 | 7224.0 | 35505.105960 | 1.0 | 359.906834 | 73.227974 | 1.395848e-17 |
The P value being quite small and indicating that there is a statistically significant difference. The P value is in fact exactly the same as the P value on the coefficient for the squared term. Let's look at the accuracy of the quadratic model and we will use the mean absolute error.
mean_absolute_error(df['average_act'], model_median_income2.predict())
1.6972389257968241
mean_absolute_error(df['average_act'], model_median_income.predict())
1.7129386980688617
The mean absolute error is 1.69, which if we compare this to the first model is smaller but not practically smaller. So this shows us that we have the ability to look at a relationship between one of our socioeconomic predictor variables and the average ACT score and formulate a model that provides some predictive power of what the ACT score acutally is, but it is a relatively weak prediction. We've also seen that a linear model is probably going to be sufficient to predict the ACT score and that considering something like a quadratic is not necessary going to provide a much better fit.
Rate Unemployment¶
Plot the regression line and the scatter plot
plt.figure(figsize=(6,6))
sns.regplot(data=df,
x='rate_unemployment',
y='average_act',
color='blue',
ci=False,
scatter_kws={'color': 'black', 'edgecolors': 'white', 'linewidths': 1}
)
plt.xlabel('Rate Unemployment (%)', fontsize=16)
plt.ylabel('Average ACT score', fontsize=16)
plt.tick_params(labelsize=14)
plt.show()
There is a relationship between the rate unemployment and the average ACT score, and it appears that this simple linear regression is providing only moderate fit of the data. Let's actually fit the model and then we will asses it using graphical and numerical methods
model_rate_unemployment = smf.ols(formula='average_act ~ rate_unemployment', data=df).fit()
print(model_rate_unemployment.summary())
OLS Regression Results
==============================================================================
Dep. Variable: average_act R-squared: 0.188
Model: OLS Adj. R-squared: 0.188
Method: Least Squares F-statistic: 1669.
Date: Sun, 19 Oct 2025 Prob (F-statistic): 0.00
Time: 15:10:15 Log-Likelihood: -16149.
No. Observations: 7227 AIC: 3.230e+04
Df Residuals: 7225 BIC: 3.232e+04
Df Model: 1
Covariance Type: nonrobust
=====================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------
Intercept 22.1512 0.053 421.397 0.000 22.048 22.254
rate_unemployment -19.2070 0.470 -40.859 0.000 -20.129 -18.286
==============================================================================
Omnibus: 140.855 Durbin-Watson: 1.378
Prob(Omnibus): 0.000 Jarque-Bera (JB): 287.582
Skew: 0.071 Prob(JB): 3.57e-63
Kurtosis: 3.967 Cond. No. 17.8
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
This assessment tells us that the intercept in the model is 22.15 and the coefficient on median income is -19.21. We also interested in the statistical significance of the coefficient particularly the one on our predictor, and we can look in the column for the P values to see that they are small and zero up to three decimal places, so we have statistically significant coefficients. Let's compute the R-squared.
Compute the RMSE
y_hat_unemployment = model_rate_unemployment.predict()
np.sqrt(mean_squared_error(df['average_act'], y_hat_unemployment)).round(3)
np.float64(2.261)
From the computation above, the model was not perfect, as we saw, that is a relatively small error in terms of the range of possible values for an ACT score. If we compare this with the median income model, which has RSME is about 1.713, that the median income model is better predictor of ACT score than unemployment rate. We would also like to asses the model using a graphical method and standard approach is to use a residual plot.
plt.figure(figsize=(6, 6))
plt.plot(y_hat_unemployment, model_rate_unemployment.resid, 'ko', mec='w')
plt.axhline(0, color='r', linestyle='dashed', lw=2)
plt.xlabel('Predicted average ACT score', fontsize=16)
plt.ylabel('Residual average ACT score', fontsize=16)
plt.tick_params(labelsize=14)
plt.show()
plt.figure(figsize=(6, 6))
sns.regplot(data=df,
x='rate_unemployment',
y='average_act',
color='blue',
ci=False,
scatter_kws={'color': 'black', 'edgecolors': 'white', 'linewidths': 1}
)
sns.regplot(data=df,
x='rate_unemployment',
y='average_act',
order=2,
color='orange',
ci=False,
scatter=False
)
plt.xlabel("Rate Unemployment (%)", fontsize=16)
plt.ylabel("Average ACT score", fontsize=16)
plt.tick_params(labelsize=14)
plt.show()
model_rate_unemployment2 = smf.ols( # type: ignore
formula='average_act ~ rate_unemployment + I(rate_unemployment**2)',
data=df
).fit()
print(model_rate_unemployment2.summary())
OLS Regression Results
==============================================================================
Dep. Variable: average_act R-squared: 0.193
Model: OLS Adj. R-squared: 0.193
Method: Least Squares F-statistic: 865.0
Date: Sun, 19 Oct 2025 Prob (F-statistic): 0.00
Time: 15:10:16 Log-Likelihood: -16125.
No. Observations: 7227 AIC: 3.226e+04
Df Residuals: 7224 BIC: 3.228e+04
Df Model: 2
Covariance Type: nonrobust
=============================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------------
Intercept 22.5949 0.082 275.312 0.000 22.434 22.756
rate_unemployment -27.5005 1.270 -21.648 0.000 -29.991 -25.010
I(rate_unemployment ** 2) 28.4906 4.056 7.024 0.000 20.539 36.442
==============================================================================
Omnibus: 133.726 Durbin-Watson: 1.377
Prob(Omnibus): 0.000 Jarque-Bera (JB): 270.419
Skew: 0.059 Prob(JB): 1.90e-59
Kurtosis: 3.940 Cond. No. 160.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The summary shows that the R-squared is 0.193, and it is very slightly higher than the previous model. The coefficient on the squared term is statistically significant. It is not clear how much that is improving the model. So we have seen from this analysis that we do have a significant quadratic term as well as a significant linear term.
model_rate_unemployment2.pvalues
Intercept 0.000000e+00 rate_unemployment 9.628233e-101 I(rate_unemployment ** 2) 2.360918e-12 dtype: float64
sms.anova_lm(model_rate_unemployment, model_rate_unemployment2)
| df_resid | ssr | df_diff | ss_diff | F | Pr(>F) | |
|---|---|---|---|---|---|---|
| 0 | 7225.0 | 36932.894308 | 0.0 | NaN | NaN | NaN |
| 1 | 7224.0 | 36682.406013 | 1.0 | 250.488295 | 49.329574 | 2.360918e-12 |
mean_absolute_error(df['average_act'], model_rate_unemployment2.predict())
1.7381981457701847
mean_absolute_error(df['average_act'], model_rate_unemployment.predict())
1.7421893796735084
The quadratic model techinically improve the model fit better but the improvement is tiny, from 0.188 to 0.193. The P value being quite small and indicating that there is a statistically significant difference. The P value is in fact exactly the same as the P value on the coefficient for the squared term. The mean absolute error is a little better than first model but not practically bigger. Therefore, the linear model is sufficient to predict ACT score and quadratic is not necessary.
Percent College¶
Plot the regression line and the scatter plot
plt.figure(figsize=(6,6))
sns.regplot(data=df,
x='percent_college',
y='average_act',
color='blue',
ci=False,
scatter_kws={'color': 'black', 'edgecolors': 'white', 'linewidths': 1}
)
plt.xlabel('Percent College (%)', fontsize=16)
plt.ylabel('Average ACT score', fontsize=16)
plt.tick_params(labelsize=14)
plt.show()
There is a relationship between the rate unemployment and the average ACT score, and it appears that this simple linear regression is providing only moderate fit of the data. Let's actually fit the model and then we will asses it using graphical and numerical methods
model_percent_college = smf.ols(formula='average_act ~ percent_college', data=df).fit()
print(model_percent_college.summary())
OLS Regression Results
==============================================================================
Dep. Variable: average_act R-squared: 0.210
Model: OLS Adj. R-squared: 0.210
Method: Least Squares F-statistic: 1922.
Date: Sun, 19 Oct 2025 Prob (F-statistic): 0.00
Time: 15:10:17 Log-Likelihood: -16048.
No. Observations: 7227 AIC: 3.210e+04
Df Residuals: 7225 BIC: 3.211e+04
Df Model: 1
Covariance Type: nonrobust
===================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------
Intercept 16.3039 0.095 171.939 0.000 16.118 16.490
percent_college 6.9715 0.159 43.837 0.000 6.660 7.283
==============================================================================
Omnibus: 354.224 Durbin-Watson: 1.208
Prob(Omnibus): 0.000 Jarque-Bera (JB): 525.820
Skew: -0.443 Prob(JB): 6.60e-115
Kurtosis: 3.980 Cond. No. 8.10
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
This assessment tells us that the intercept when the percent college at 0 is 16.304, and when the percent of college increase 0.01 then the average ACT score increase by about 0.0697. The R-squared show that about 21% of variation in ACT score can be explain by percent_college. The p-value also show that they are really small and zero up to three decimal places.
Compute the RMSE
y_hat_percent_college = model_percent_college.predict()
np.sqrt(mean_squared_error(df['average_act'], y_hat_percent_college)).round(3)
np.float64(2.229)
Compute the mean absolute error
mean_absolute_error(df['average_act'], y_hat_percent_college)
1.7169024235889725
From the computation above, the model was not perfect, as we saw, that is a relatively small error in terms of the range of possible values for an ACT score. So it is saying that we are able to in some way predict the ACT score from this particular input variable. We would also like to asses the model using a graphical method and standard approach is to use a residual plot.
plt.figure(figsize=(6, 6))
plt.plot(y_hat_percent_college, model_percent_college.resid, 'ko', mec='w')
plt.axhline(0, color='r', linestyle='dashed', lw=2)
plt.xlabel('Predicted average ACT score', fontsize=16)
plt.ylabel('Residual average ACT score', fontsize=16)
plt.tick_params(labelsize=14)
plt.show()
This plot is a cloud of point and look randomly scattered around zero. It looks like the area with more college adults tend to have higher average ACT score, and the linear relationship describe the trend well. Therefore, the linear regression model is sufficient to predict the ACT score.
Percent Married¶
Plot the regression line and the scatter plot
plt.figure(figsize=(6,6))
sns.regplot(data=df,
x='percent_married',
y='average_act',
color='blue',
ci=False,
scatter_kws={'color': 'black', 'edgecolors': 'white', 'linewidths': 1}
)
plt.xlabel('Percent Married (%)', fontsize=16)
plt.ylabel('Average ACT score', fontsize=16)
plt.tick_params(labelsize=14)
plt.show()
There is a relationship between the adult live in a married family and the average ACT score, and it appears that this simple linear regression is providing only moderate fit of the data. Let's actually fit the model and then we will asses it using graphical and numerical methods
model_percent_married = smf.ols(formula='average_act ~ percent_married', data=df).fit()
print(model_percent_married.summary())
OLS Regression Results
==============================================================================
Dep. Variable: average_act R-squared: 0.193
Model: OLS Adj. R-squared: 0.193
Method: Least Squares F-statistic: 1733.
Date: Sun, 19 Oct 2025 Prob (F-statistic): 0.00
Time: 15:10:17 Log-Likelihood: -16124.
No. Observations: 7227 AIC: 3.225e+04
Df Residuals: 7225 BIC: 3.227e+04
Df Model: 1
Covariance Type: nonrobust
===================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------
Intercept 16.6046 0.093 179.296 0.000 16.423 16.786
percent_married 5.7686 0.139 41.628 0.000 5.497 6.040
==============================================================================
Omnibus: 183.522 Durbin-Watson: 1.418
Prob(Omnibus): 0.000 Jarque-Bera (JB): 369.094
Skew: 0.160 Prob(JB): 7.12e-81
Kurtosis: 4.060 Cond. No. 7.43
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
This assessment tells us that the intercept when the percent college at 0 is 16.604, and when the percent of live in married family increase 0.01 then the average ACT score increase by about 0.058. The R-squared show that about 19% of variation in ACT score can be explain by percent_college. The p-value also show that they are really small and zero up to three decimal places.
Compute the RMSE
y_hat_percent_married = model_percent_married.predict()
np.sqrt(mean_squared_error(df['average_act'], y_hat_percent_married)).round(3)
np.float64(2.253)
Compute the mean absolute error
mean_absolute_error(df['average_act'], y_hat_percent_married)
1.7361820917022404
From the computation above, the model was not perfect, as we saw, that is a relatively small error in terms of the range of possible values for an ACT score. So it is saying that we are able to in some way predict the ACT score from this particular input variable. We would also like to asses the model using a graphical method and standard approach is to use a residual plot.
plt.figure(figsize=(6, 6))
plt.plot(y_hat_percent_married, model_percent_married.resid, 'ko', mec='w')
plt.axhline(0, color='r', linestyle='dashed', lw=2)
plt.xlabel('Predicted average ACT score', fontsize=16)
plt.ylabel('Residual average ACT score', fontsize=16)
plt.tick_params(labelsize=14)
plt.show()
This plot is a cloud of point and look randomly scattered around zero. It looks like the area with more adult live in a married family tend to have higher average ACT score, and the linear relationship describe the trend well. Therefore, the linear regression model is sufficient to predict the ACT score.
Percent Lunch¶
Plot the regression line and the scatter plot
plt.figure(figsize=(6,6))
sns.regplot(data=df,
x='percent_lunch',
y='average_act',
color='blue',
ci=False,
scatter_kws={'color': 'black', 'edgecolors': 'white', 'linewidths': 1}
)
plt.xlabel('Percent Adult receive free/reduce lunch (%)', fontsize=16)
plt.ylabel('Average ACT score', fontsize=16)
plt.tick_params(labelsize=14)
plt.show()
There is a relationship between the percent adult receive free/reduce price for lunch and the average ACT score, and it appears that this simple linear regression is providing only moderate fit of the data. Let's actually fit the model and then we will asses it using graphical and numerical methods
model_percent_lunch = smf.ols(formula='average_act ~ percent_lunch', data=df).fit()
print(model_percent_lunch.summary())
OLS Regression Results
==============================================================================
Dep. Variable: average_act R-squared: 0.614
Model: OLS Adj. R-squared: 0.614
Method: Least Squares F-statistic: 1.149e+04
Date: Sun, 19 Oct 2025 Prob (F-statistic): 0.00
Time: 15:10:18 Log-Likelihood: -13461.
No. Observations: 7227 AIC: 2.693e+04
Df Residuals: 7225 BIC: 2.694e+04
Df Model: 1
Covariance Type: nonrobust
=================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept 23.7429 0.037 641.745 0.000 23.670 23.815
percent_lunch -8.3902 0.078 -107.185 0.000 -8.544 -8.237
==============================================================================
Omnibus: 842.406 Durbin-Watson: 1.472
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2845.416
Skew: 0.582 Prob(JB): 0.00
Kurtosis: 5.845 Cond. No. 5.02
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
This assessment tells us that the intercept when the percent college at 0 is 23.74, and when the percent of adult receive free or reduce price for lunch increase 0.01 then the average ACT score decrease by about 0.084. The R-squared show that about 61% of variation in ACT score can be explain by percent_lunch. The p-value also show that they are really small and zero up to three decimal places.
Compute the RMSE
y_hat_percent_lunch = model_percent_lunch.predict()
np.sqrt(mean_squared_error(df['average_act'], y_hat_percent_lunch)).round(3)
np.float64(1.559)
Compute the mean absolute error
mean_absolute_error(df['average_act'], y_hat_percent_lunch)
1.1689939374388465
From the computation above, the model was not perfect, as we saw, that is a relatively small error in terms of the range of possible values for an ACT score. So it is saying that we are able to in some way predict the ACT score from this particular input variable. We would also like to asses the model using a graphical method and standard approach is to use a residual plot.
plt.figure(figsize=(6, 6))
plt.plot(y_hat_percent_lunch, model_percent_lunch.resid, 'ko', mec='w')
plt.axhline(0, color='r', linestyle='dashed', lw=2)
plt.xlabel('Predicted average ACT score', fontsize=16)
plt.ylabel('Residual average ACT score', fontsize=16)
plt.tick_params(labelsize=14)
plt.show()
The model using percent_lunch as a predictor is both statistically strong and practically meaningful. The linear regression captures a clear, negative relationship between socioeconomic disadvantage and ACT performance, explains most of the score variation, and produces unbiased residuals.
Multiple linear regression¶
We are now going to fit a multiple linear regression model including all of the predictors that we would like to use in the model using all socioeconomic variables
model = smf.ols(
formula='average_act ~ rate_unemployment + percent_college + percent_married + median_income + percent_lunch',
data=df).fit()
print(model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: average_act R-squared: 0.628
Model: OLS Adj. R-squared: 0.628
Method: Least Squares F-statistic: 2438.
Date: Sun, 19 Oct 2025 Prob (F-statistic): 0.00
Time: 15:10:18 Log-Likelihood: -13328.
No. Observations: 7227 AIC: 2.667e+04
Df Residuals: 7221 BIC: 2.671e+04
Df Model: 5
Covariance Type: nonrobust
=====================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------
Intercept 22.6972 0.138 164.974 0.000 22.427 22.967
rate_unemployment -2.2689 0.404 -5.614 0.000 -3.061 -1.477
percent_college 1.7387 0.158 11.039 0.000 1.430 2.047
percent_married -0.0747 0.134 -0.558 0.577 -0.337 0.188
median_income -1.212e-07 1.21e-06 -0.100 0.920 -2.49e-06 2.25e-06
percent_lunch -7.6045 0.097 -78.542 0.000 -7.794 -7.415
==============================================================================
Omnibus: 870.700 Durbin-Watson: 1.483
Prob(Omnibus): 0.000 Jarque-Bera (JB): 3129.508
Skew: 0.584 Prob(JB): 0.00
Kurtosis: 6.005 Cond. No. 1.34e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.34e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
We see that we have two of our model coefficients that are not statistically significant. The coefficient on percent_married and median_income have relatively large P values, indicating that they are not statistically significant. We noticed in our exploratory data analysis that there were correlations among these predictor variables, so it make sense that we do not need all of these predictors in a single model. We also note that the R-squared is roughly 0.63, which much higher than what we had seen with any of our individual predict models. We can use a residual plot for a graphical assessment of the model fit to see if we have used these predictor variables in the best way possible or if there's some transformation that we might apply to improve the fit.
y_hat = model.predict()
plt.figure(figsize=(6, 6))
plt.plot(y_hat, model.resid, 'ko', mec='w')
plt.axhline(0, color='r', linestyle='dashed', lw=2)
plt.xlabel('Predicted average ACT score', fontsize=16)
plt.ylabel('Residual average ACT score', fontsize=16)
plt.tick_params(labelsize=14)
plt.show()
This plot shows that there is relatively no structure to the residual plot, indicating that we are unlikely to improve the model by making transformations of the input variables that we have used in the model. We can also do a numerical assessment of the accuracy of the model and looking at the mean absolute error.
mean_absolute_error(df['average_act'], model.predict())
1.1453304788871002
This is less than what we were seeing in the individual predictor models. After fitting the full multiple linear regression model including all of the socioeconomic predictor varables that we wanted to include, and finding that some of the coefficient were not statistically significant. We will fit a reduced model that only includes those predictor variables that has statistically significant coefficients and see if that reduced model is sufficient to predict the average ACT score. In this case, we only include unemployment rate, percent college, and the percent of students eligible for free or reduced price lunch in the model
model_reduced = smf.ols(
formula='average_act ~ rate_unemployment + percent_college + percent_lunch',
data=df).fit()
print(model_reduced.summary())
OLS Regression Results
==============================================================================
Dep. Variable: average_act R-squared: 0.628
Model: OLS Adj. R-squared: 0.628
Method: Least Squares F-statistic: 4063.
Date: Sun, 19 Oct 2025 Prob (F-statistic): 0.00
Time: 15:10:18 Log-Likelihood: -13328.
No. Observations: 7227 AIC: 2.666e+04
Df Residuals: 7223 BIC: 2.669e+04
Df Model: 3
Covariance Type: nonrobust
=====================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------
Intercept 22.6400 0.102 220.905 0.000 22.439 22.841
rate_unemployment -2.1683 0.374 -5.801 0.000 -2.901 -1.436
percent_college 1.7139 0.127 13.504 0.000 1.465 1.963
percent_lunch -7.5863 0.093 -81.992 0.000 -7.768 -7.405
==============================================================================
Omnibus: 873.251 Durbin-Watson: 1.483
Prob(Omnibus): 0.000 Jarque-Bera (JB): 3129.902
Skew: 0.587 Prob(JB): 0.00
Kurtosis: 6.003 Cond. No. 25.9
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
We do have all three of these variable having statistically significant coefficients. We can use a residual plot for a graphical assessment of this model fit and see that we have a very similar residual plot to what we had with the full model.
y_hat_reduced = model_reduced.predict()
plt.figure(figsize=(6, 6))
plt.plot(y_hat_reduced, model_reduced.resid, 'ko', mec='w')
plt.axhline(0, color='r', linestyle='dashed', lw=2)
plt.xlabel('Predicted average ACT score', fontsize=16)
plt.ylabel('Residual average ACT score', fontsize=16)
plt.tick_params(labelsize=14)
plt.show()
This is again essentially no structure to the model, indicating that we do not need to include transformations of these particular predicotrs in order to improve the model. We can do a numerical assessment of the accuracy again using the mean absolute error.
mean_absolute_error(df['average_act'], model_reduced.predict())
1.1454832950346427
This looks very similar to what we had with the full model. In fact, when we compare the accuracy of the model in terms of the mean absolute error or R-squared between the reduced model and the full model that included all five predictor variables, they are essentially equivalent.
mae_full = mean_absolute_error(df['average_act'], model.predict())
mae_reduced = mean_absolute_error(df['average_act'], model_reduced.predict())
r2_full = model.rsquared
r2_reduced = model_reduced.rsquared
pd.DataFrame({'Mean Absolute Error': [mae_full, mae_reduced],
'R-squared': [r2_full, r2_reduced]},
index=['full model', 'reduced model']).round(4)
| Mean Absolute Error | R-squared | |
|---|---|---|
| full model | 1.1453 | 0.6280 |
| reduced model | 1.1455 | 0.6279 |
We can also consider the significance of the difference between the model using an ANOVA.
sms.anova_lm(model_reduced, model)
| df_resid | ssr | df_diff | ss_diff | F | Pr(>F) | |
|---|---|---|---|---|---|---|
| 0 | 7223.0 | 16916.619167 | 0.0 | NaN | NaN | NaN |
| 1 | 7221.0 | 16915.612457 | 2.0 | 1.006711 | 0.214874 | 0.806648 |
We can see that there is no statistically significant difference between the reduced model that just included those three predictor variables and the larger model that had five of socioeconomic predictor variables. Now, we are going to scale the predictors in the reduced model so that they have a mean of 0 and a standard deviation of 1. So that we can use the magnitude of the coefficients in this model to compare the relative importance of each of those predictor variables at contributing to our estimate of the average ACT score.
predictor_variables = ['rate_unemployment', 'percent_college', 'percent_lunch']
scaled_columns = [var + '_normalized' for var in predictor_variables]
print(scaled_columns)
['rate_unemployment_normalized', 'percent_college_normalized', 'percent_lunch_normalized']
scaler = StandardScaler().fit(df[predictor_variables])
df[scaled_columns] = scaler.transform(df[predictor_variables])
Now let's check and see that these new variables do in fact have a mean of 0 and a standard deviation of 1. We might expect there to be some small differences just due to calculation error, and they do have means of essentially zero and standard deviations of one.
df[scaled_columns].agg(['mean', 'std']).round(3)
| rate_unemployment_normalized | percent_college_normalized | percent_lunch_normalized | |
|---|---|---|---|
| mean | 0.0 | -0.0 | -0.0 |
| std | 1.0 | 1.0 | 1.0 |
So now we can fit the multiple linear regression model with those normalized predictors. So we will use again three values in the reduced model for them to have statistically significant coefficents.
model_normalize = smf.ols(
formula='average_act ~ rate_unemployment_normalized + percent_college_normalized + percent_lunch_normalized',
data=df).fit()
print(model_normalize.summary())
OLS Regression Results
==============================================================================
Dep. Variable: average_act R-squared: 0.628
Model: OLS Adj. R-squared: 0.628
Method: Least Squares F-statistic: 4063.
Date: Sun, 19 Oct 2025 Prob (F-statistic): 0.00
Time: 15:10:18 Log-Likelihood: -13328.
No. Observations: 7227 AIC: 2.666e+04
Df Residuals: 7223 BIC: 2.669e+04
Df Model: 3
Covariance Type: nonrobust
================================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------------------
Intercept 20.2986 0.018 1127.578 0.000 20.263 20.334
rate_unemployment_normalized -0.1227 0.021 -5.801 0.000 -0.164 -0.081
percent_college_normalized 0.2826 0.021 13.504 0.000 0.242 0.324
percent_lunch_normalized -1.7770 0.022 -81.992 0.000 -1.819 -1.734
==============================================================================
Omnibus: 873.251 Durbin-Watson: 1.483
Prob(Omnibus): 0.000 Jarque-Bera (JB): 3129.902
Skew: 0.587 Prob(JB): 0.00
Kurtosis: 6.003 Cond. No. 1.93
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
We see that we have a much larger magnitude of a coefficient for the percent lunch variable than for either of the other two variables. So this says that the estimated change in the average ACT score is much larger magnitude when we would have a one standard deviation change in the percent lunch variable as comapred to the percent college or the unemployment rate variable. We may be noting that R squared value looks like it is the same as what we had with our reduced model. Let's compare the accuracy between the original and normalized models.
mae_normalized = mean_absolute_error(df['average_act'], model_normalize.predict())
mae_reduced = mean_absolute_error(df['average_act'], model_reduced.predict())
r2_normalized = model_normalize.rsquared
r2_reduced = model_reduced.rsquared
pd.DataFrame({'Mean Absolute Error': [mae_normalized, mae_reduced],
'R-squared': [r2_normalized, r2_reduced]},
index=['normalized model', 'reduced model']).round(4)
| Mean Absolute Error | R-squared | |
|---|---|---|
| normalized model | 1.1455 | 0.6279 |
| reduced model | 1.1455 | 0.6279 |
In fact, when we compare all of these numerical measures, the mean absolute error or the R-squared, they are exactly the same between the model with the unscaled predictors and with these scaled predictors. This is because the transformation from the original predictors to the standardized versions does not lead to an overall change in the linear regression model, it is just changing what the coefficients in that model are .
Now we are going to load new dataset to get more variables
new_school_information = pd.read_csv('../data/School_Neighborhood_Poverty_Estimates%2C_2016-17.csv', encoding='unicode_escape')
Let's eplore the contents of new dataset to see the names of columns and a few example values for each columns. We also check whether the data is in tidy format.
new_school_information.head()
| X | Y | NCESSCH | NAME | IPR_EST | IPR_SE | OBJECTID | LAT1617 | LON1617 | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | -86.628760 | 33.673667 | 10000200277 | Sequoyah Sch - Chalkville Campus | 252 | 112 | 1 | 33.673661 | -86.628755 |
| 1 | -86.532753 | 32.519175 | 10000201667 | Camps | 217 | 71 | 2 | 32.519169 | -86.532748 |
| 2 | -87.750169 | 31.937797 | 10000201670 | Det Ctr | 290 | 77 | 3 | 31.937791 | -87.750164 |
| 3 | -86.083210 | 32.375712 | 10000201705 | Wallace Sch - Mt Meigs Campus | 267 | 78 | 4 | 32.375706 | -86.083205 |
| 4 | -86.710585 | 33.586713 | 10000201706 | McNeel Sch - Vacca Campus | 172 | 56 | 5 | 33.586707 | -86.710580 |
Use the info method to check the data types, size of the data frame, and numbers of missing values
new_school_information.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 100623 entries, 0 to 100622 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 X 100623 non-null float64 1 Y 100623 non-null float64 2 NCESSCH 100623 non-null int64 3 NAME 100623 non-null object 4 IPR_EST 100623 non-null int64 5 IPR_SE 100623 non-null int64 6 OBJECTID 100623 non-null int64 7 LAT1617 100623 non-null float64 8 LON1617 100623 non-null float64 dtypes: float64(4), int64(4), object(1) memory usage: 6.9+ MB
From the information above, we can see that there is missing value. Next, we need to check for any duplicated rows to prevent any incorrect value
new_school_information.duplicated().sum()
np.int64(0)
The new school information data set contains many unrelated columns, so we only need the NCESSCH, IPR_EST and IPR_SE.
new_school_information = new_school_information[['NCESSCH', 'IPR_EST', 'IPR_SE']]
new_school_information.head()
| NCESSCH | IPR_EST | IPR_SE | |
|---|---|---|---|
| 0 | 10000200277 | 252 | 112 |
| 1 | 10000201667 | 217 | 71 |
| 2 | 10000201670 | 290 | 77 |
| 3 | 10000201705 | 267 | 78 |
| 4 | 10000201706 | 172 | 56 |
We will rename the columns to follow best practices of style and being informative. We will do it before joining data sets to make it obvious that the key has the same name in each data set.
new_school_information = new_school_information.rename(
columns={
'NCESSCH': 'id',
'IPR_EST': 'income_poverty_ratio_estimate',
'IPR_SE': 'income_poverty_se'
}
)
new_school_information.head()
| id | income_poverty_ratio_estimate | income_poverty_se | |
|---|---|---|---|
| 0 | 10000200277 | 252 | 112 |
| 1 | 10000201667 | 217 | 71 |
| 2 | 10000201670 | 290 | 77 |
| 3 | 10000201705 | 267 | 78 |
| 4 | 10000201706 | 172 | 56 |
We want to join the DataFrames using the identity of the school as the key. The identity is given by the NCESSCH school identity.
The value is an int64 in the new school information data set after recreated and an object in the current dataset.
We will cast the id column in the id of the new school information to object to have the same type of current Data Frame
new_school_information['id'] = new_school_information['id'].astype('object')
new_school_information.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 100623 entries, 0 to 100622 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 100623 non-null object 1 income_poverty_ratio_estimate 100623 non-null int64 2 income_poverty_se 100623 non-null int64 dtypes: int64(2), object(1) memory usage: 2.3+ MB
new_df = df.merge(
new_school_information,
how='left',
on='id'
)
new_df.head()
| id | rate_unemployment | percent_college | percent_married | median_income | average_act | percent_lunch | year | state | zip_code | school_type | school_level | charter | rate_unemployment_normalized | percent_college_normalized | percent_lunch_normalized | income_poverty_ratio_estimate | income_poverty_se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100001600143 | 0.117962 | 0.445283 | 0.346495 | 42820.0 | 20.433455 | 0.066901 | 2016-2017 | DE | 19804 | Regular School | High | Yes | 0.380125 | -0.774475 | -1.466983 | 258 | 92 |
| 1 | 100008000024 | 0.063984 | 0.662765 | 0.767619 | 89320.0 | 19.498168 | 0.112412 | 2016-2017 | DE | 19709 | Regular School | High | No | -0.573936 | 0.544280 | -1.272689 | 527 | 90 |
| 2 | 100008000225 | 0.056460 | 0.701864 | 0.713090 | 84140.0 | 19.554335 | 0.096816 | 2016-2017 | DE | 19709 | Regular School | High | No | -0.706931 | 0.781372 | -1.339271 | 399 | 119 |
| 3 | 100017000029 | 0.044739 | 0.692062 | 0.641283 | 56500.0 | 17.737485 | 0.296960 | 2016-2017 | DE | 19958 | Regular School | High | No | -0.914109 | 0.721933 | -0.484817 | 397 | 81 |
| 4 | 100018000040 | 0.077014 | 0.640060 | 0.834402 | 54015.0 | 18.245421 | 0.262641 | 2016-2017 | DE | 19934 | Regular School | High | No | -0.343648 | 0.406606 | -0.631332 | 281 | 93 |
new_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7227 entries, 0 to 7226 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 7227 non-null object 1 rate_unemployment 7227 non-null float64 2 percent_college 7227 non-null float64 3 percent_married 7227 non-null float64 4 median_income 7227 non-null float64 5 average_act 7227 non-null float64 6 percent_lunch 7227 non-null float64 7 year 7227 non-null object 8 state 7227 non-null object 9 zip_code 7227 non-null object 10 school_type 7227 non-null object 11 school_level 7227 non-null object 12 charter 7227 non-null object 13 rate_unemployment_normalized 7227 non-null float64 14 percent_college_normalized 7227 non-null float64 15 percent_lunch_normalized 7227 non-null float64 16 income_poverty_ratio_estimate 7227 non-null int64 17 income_poverty_se 7227 non-null int64 dtypes: float64(9), int64(2), object(7) memory usage: 1016.4+ KB
Now that we have a merged data frame that contain one more variable for analysis. Let's check how many values of each variable are missing value when we add new variable.
new_df.isna().sum().to_frame(name="Number of Missing Values")
| Number of Missing Values | |
|---|---|
| id | 0 |
| rate_unemployment | 0 |
| percent_college | 0 |
| percent_married | 0 |
| median_income | 0 |
| average_act | 0 |
| percent_lunch | 0 |
| year | 0 |
| state | 0 |
| zip_code | 0 |
| school_type | 0 |
| school_level | 0 |
| charter | 0 |
| rate_unemployment_normalized | 0 |
| percent_college_normalized | 0 |
| percent_lunch_normalized | 0 |
| income_poverty_ratio_estimate | 0 |
| income_poverty_se | 0 |
So we can see that we don't have any missing value for new variable. Now we have cleaned data set and ready for next analysis and saved the clean data frame as a CVS file for second version
new_df.to_csv(
'../data/education_clean_v2.csv',
encoding='utf-8-sig',
index=False
)
The cleaned data set save to in ../data/education_clean_v2.csv
Plot the correlation matrix of the numerical variables in the training data to explore relationships between the variables.
predictor_variables = [
'rate_unemployment',
'percent_college',
'percent_married',
'median_income',
'percent_lunch',
'income_poverty_ratio_estimate'
]
numerical_predictors = new_df[predictor_variables].select_dtypes(include='number').columns.to_list()
corr_matrix = new_df[numerical_predictors + ["average_act"]].corr()
plt.figure(figsize=(10, 5))
sns.heatmap(
corr_matrix, vmax=1, vmin=-1, square=True, annot=True, cmap="viridis"
)
plt.tick_params(labelsize=12)
plt.show()
We can see that the correlation between income_poverty_ratio_estimate and average_act is 0.52, which is lager than other variables that have possitive correlation and it is the biggest in magnitude. Next, let's make pair plots to explore relationships between the variables
fig = sns.pairplot(
data=new_df,
vars=numerical_predictors + ['average_act'],
hue='charter',
kind='reg',
plot_kws={"scatter_kws": {"alpha": 0.5, "color": "k", "s": 7},},
)
for ax in fig.axes.flat:
if ax.get_xlabel() == 'CT Median Household Income':
ax.ticklabel_format(style='sci', axis='x', scilimits=(0,0)) # Apply scientific notation
ax.set_xlabel(ax.get_xlabel(), fontsize=8, rotation=30, ha='right') # X-axis label size and rotation
ax.set_ylabel(ax.get_ylabel(), fontsize=8) # Y-axis label size
# Rotate x-axis tick labels
plt.setp(ax.get_xticklabels(), rotation=30, ha='right')
plt.show()
We can use the interquartile range to identify ourliers. This is also evident in boxplots of the data for income_poverty_ratio_estimate
plt.figure(figsize=(3,3))
sns.boxplot(data=new_df[['income_poverty_ratio_estimate']], color='k')
plt.ylabel('Proportion', fontsize=15)
plt.tick_params(labelsize=12)
plt.show()
The boxplot shows that the income_poverty_ratio_estimate variable contains several outliers. These represent schools located in neighborhoods with income-to-poverty ratios significantly higher than most others, suggesting a skewed distribution toward higher-income areas. Let fit and assess the model predicting the average ACT score from income poverty ratio estimate.
plt.figure(figsize=(6,6))
sns.regplot(data=new_df,
x='income_poverty_ratio_estimate',
y='average_act',
color='blue',
ci=False,
scatter_kws={'color': 'black', 'edgecolors': 'white', 'linewidths': 1}
)
plt.xlabel('Income Poverty Ratio Estimate', fontsize=16)
plt.ylabel('Average ACT score', fontsize=16)
plt.tick_params(labelsize=14)
plt.show()
There is a relationship between the income poverty ratio estimate and the average ACT score, and it appears that this simple linear regression is providing only moderate fit of the data. Let's actually fit the model and then we will asses it using graphical and numerical methods
model_income_poverty_estimate = smf.ols(formula='average_act ~ income_poverty_ratio_estimate', data=new_df).fit()
print(model_income_poverty_estimate.summary())
OLS Regression Results
==============================================================================
Dep. Variable: average_act R-squared: 0.275
Model: OLS Adj. R-squared: 0.275
Method: Least Squares F-statistic: 2740.
Date: Sun, 19 Oct 2025 Prob (F-statistic): 0.00
Time: 15:10:36 Log-Likelihood: -15739.
No. Observations: 7227 AIC: 3.148e+04
Df Residuals: 7225 BIC: 3.149e+04
Df Model: 1
Covariance Type: nonrobust
=================================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------------------
Intercept 17.6586 0.056 313.406 0.000 17.548 17.769
income_poverty_ratio_estimate 0.0088 0.000 52.348 0.000 0.008 0.009
==============================================================================
Omnibus: 404.652 Durbin-Watson: 1.259
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1095.156
Skew: -0.299 Prob(JB): 1.55e-238
Kurtosis: 4.811 Cond. No. 751.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
This assessment tells us that the intercept in the model is 17.65 and the coefficient on income_poverty_ratio_estimate is 0.0088. It is a small coefficient. We also interested in the statistical significance of the coefficient particularly the one on our predictor, and we can look in the column for the P values to see that they are small and zero up to three decimal places, so we have statistically significant coefficients. Let's compute the R-squared.
model_income_poverty_estimate.rsquared
np.float64(0.2749858672251153)
The R-squared show that about 27.5 variants in the average ACT score can be explained by the income_poverty_ratio_estimate. Let's compute the RMSE and the mean absolute error.
y_hat_income_poverty = model_income_poverty_estimate.predict()
np.sqrt(mean_squared_error(new_df['average_act'], y_hat_income_poverty)).round(3)
np.float64(2.136)
mean_absolute_error(df['average_act'], y_hat_income_poverty)
1.619754091555486
From the computation above, the model was not perfect, as we saw, that is a relatively small error in terms of the range of possible values for an ACT score. So it is saying that we are able to in some way predict the ACT score from this particular input variable. We would also like to asses the model using a graphical method and standard approach is to use a residual plot.
plt.figure(figsize=(6, 6))
plt.plot(y_hat_income_poverty, model_income_poverty_estimate.resid, 'ko', mec='w')
plt.axhline(0, color='r', linestyle='dashed', lw=2)
plt.xlabel('Predicted average ACT score', fontsize=16)
plt.ylabel('Residual average ACT score', fontsize=16)
plt.tick_params(labelsize=14)
plt.show()
From the plot we can see there is maybe a downward trend as predicted ACT score increase which might be overpredict for high ACT score. Since the plot is not purely a cloud of points, we might try a more complicated model and we could consider a quadratic polynomial regression model.
plt.figure(figsize=(6, 6))
sns.regplot(data=new_df,
x='income_poverty_ratio_estimate',
y='average_act',
color='blue',
ci=False,
scatter_kws={'color': 'black', 'edgecolors': 'white', 'linewidths': 1}
)
sns.regplot(data=new_df,
x='income_poverty_ratio_estimate',
y='average_act',
order=2,
color='orange',
ci=False,
scatter=False
)
plt.xlabel("Income Poverty Ratio Estimate", fontsize=16)
plt.ylabel("Average ACT score", fontsize=16)
plt.tick_params(labelsize=14)
plt.show()
The qudratic model might provide a slightly better fit, but it is not clear that it is going to be significantly better than the simple linear regression. But we should fit the model and then consider the accuracy and the significance of the quadratic model.
model_income_poverty_estimate2 = smf.ols( # type: ignore
formula='average_act ~ income_poverty_ratio_estimate + I(income_poverty_ratio_estimate**2)',
data=new_df
).fit()
print(model_income_poverty_estimate2.summary())
OLS Regression Results
==============================================================================
Dep. Variable: average_act R-squared: 0.313
Model: OLS Adj. R-squared: 0.313
Method: Least Squares F-statistic: 1649.
Date: Sun, 19 Oct 2025 Prob (F-statistic): 0.00
Time: 15:10:37 Log-Likelihood: -15542.
No. Observations: 7227 AIC: 3.109e+04
Df Residuals: 7224 BIC: 3.111e+04
Df Model: 2
Covariance Type: nonrobust
=========================================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------------------------
Intercept 15.7224 0.111 141.879 0.000 15.505 15.940
income_poverty_ratio_estimate 0.0204 0.001 34.050 0.000 0.019 0.022
I(income_poverty_ratio_estimate ** 2) -1.367e-05 6.8e-07 -20.107 0.000 -1.5e-05 -1.23e-05
==============================================================================
Omnibus: 307.630 Durbin-Watson: 1.377
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1014.132
Skew: -0.041 Prob(JB): 6.08e-221
Kurtosis: 4.833 Cond. No. 7.84e+05
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.84e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
The summary shows that the R-squared is 0.313, and it is very slightly higher than the previous model. The coefficient on the squared term is statistically significant. It is not clear how much that is improving the model. So we have seen from this analysis that we do have a significant quadratic term as well as a significant linear term.
model_income_poverty_estimate2.pvalues
Intercept 0.000000e+00 income_poverty_ratio_estimate 8.124518e-236 I(income_poverty_ratio_estimate ** 2) 1.553277e-87 dtype: float64
We can use an analysis of variance or ANOVA to compare these two nested polynomial linear regression models where we are comparing the simpler model to the more complicated model and statist significant in terms of its difference from the simpler model.
sms.anova_lm(model_income_poverty_estimate, model_income_poverty_estimate2)
| df_resid | ssr | df_diff | ss_diff | F | Pr(>F) | |
|---|---|---|---|---|---|---|
| 0 | 7225.0 | 32964.244161 | 0.0 | NaN | NaN | NaN |
| 1 | 7224.0 | 31217.254238 | 1.0 | 1746.989923 | 404.271788 | 1.553277e-87 |
The P value being quite small and indicating that there is a statistically significant difference. The P value is in fact exactly the same as the P value on the coefficient for the squared term. Let's look at the accuracy of the quadratic model and we will use the mean absolute error.
mean_absolute_error(new_df['average_act'], model_income_poverty_estimate2.predict())
1.5685820578069967
mean_absolute_error(new_df['average_act'], model_income_poverty_estimate.predict())
1.619754091555486
The mean absolute error is 1.57, which if we compare this to the first model is smaller but not practically smaller. So this shows us that we have the ability to look at a relationship between one of our socioeconomic predictor variables and the average ACT score and formulate a model that provides some predictive power of what the ACT score acutally is, but it is a relatively weak prediction. We've also seen that a linear model is probably going to be sufficient to predict the ACT score and that considering something like a quadratic is not necessary going to provide a much better fit.
We are now going to fit a multiple linear regression model including all of the predictors that we would like to use in the model using all socioeconomic variables and a new variable to support more accuracy of predicting ACT score.
new_model = smf.ols(
formula='average_act ~ rate_unemployment + percent_college + percent_married + median_income + percent_lunch + income_poverty_ratio_estimate',
data=new_df).fit()
print(new_model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: average_act R-squared: 0.633
Model: OLS Adj. R-squared: 0.633
Method: Least Squares F-statistic: 2079.
Date: Sun, 19 Oct 2025 Prob (F-statistic): 0.00
Time: 15:10:37 Log-Likelihood: -13275.
No. Observations: 7227 AIC: 2.656e+04
Df Residuals: 7220 BIC: 2.661e+04
Df Model: 6
Covariance Type: nonrobust
=================================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------------------
Intercept 22.5955 0.137 164.995 0.000 22.327 22.864
rate_unemployment -2.3880 0.401 -5.949 0.000 -3.175 -1.601
percent_college 1.2906 0.162 7.951 0.000 0.972 1.609
percent_married -0.0446 0.133 -0.336 0.737 -0.305 0.216
median_income -7.874e-06 1.42e-06 -5.555 0.000 -1.07e-05 -5.1e-06
percent_lunch -7.3647 0.099 -74.462 0.000 -7.559 -7.171
income_poverty_ratio_estimate 0.0022 0.000 10.298 0.000 0.002 0.003
==============================================================================
Omnibus: 926.516 Durbin-Watson: 1.490
Prob(Omnibus): 0.000 Jarque-Bera (JB): 3511.241
Skew: 0.608 Prob(JB): 0.00
Kurtosis: 6.191 Cond. No. 1.34e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.34e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
When income_poverty_ratio_estimate was added to the model, the coefficient for median_income became statistically significant. However, the change is likely due to multicollinearity among income-related variables rather than a new substantive effect. Both variables measure similar aspects of socioeconomic status, and the inflated condition number indicates instability in the estimates. To improve interpretability, one of these correlated predictors should be removed before drawing conclusions about income effects on ACT performance. We can use a residual plot for a graphical assessment of the model fit to see if we have used these predictor variables in the best way possible or if there's some transformation that we might apply to improve the fit.
new_y_hat = model.predict()
plt.figure(figsize=(6, 6))
plt.plot(new_y_hat, new_model.resid, 'ko', mec='w')
plt.axhline(0, color='r', linestyle='dashed', lw=2)
plt.xlabel('Predicted average ACT score', fontsize=16)
plt.ylabel('Residual average ACT score', fontsize=16)
plt.tick_params(labelsize=14)
plt.show()
This plot shows that there is relatively no structure to the residual plot, indicating that we are unlikely to improve the model by making transformations of the input variables that we have used in the model. We can also do a numerical assessment of the accuracy of the model and looking at the mean absolute error.
mean_absolute_error(new_df['average_act'], new_model.predict())
1.133779563385823
This is lees then the old model. Next we will fit a reduced model that only includes those predictor variables that has statistically significant coefficients and see if that reduced model is sufficient to predict the average ACT score. In this case, we only include unemployment rate, percent college, and the percent of students eligible for free or reduced price lunch, and income poverty ratio estimae in the model. Since the income_poverty_ratio_estimate is describe the context similar to the median income but it show the ratio income for the neighborhood around the school, not necessarily the actual income backgrounds of the enrolled students.
new_model_reduced = smf.ols(
formula='average_act ~ rate_unemployment + percent_college + percent_lunch + income_poverty_ratio_estimate',
data=new_df).fit()
print(new_model_reduced.summary())
OLS Regression Results
==============================================================================
Dep. Variable: average_act R-squared: 0.631
Model: OLS Adj. R-squared: 0.631
Method: Least Squares F-statistic: 3094.
Date: Sun, 19 Oct 2025 Prob (F-statistic): 0.00
Time: 15:10:37 Log-Likelihood: -13293.
No. Observations: 7227 AIC: 2.660e+04
Df Residuals: 7222 BIC: 2.663e+04
Df Model: 4
Covariance Type: nonrobust
=================================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------------------
Intercept 22.4856 0.104 216.900 0.000 22.282 22.689
rate_unemployment -1.9076 0.373 -5.109 0.000 -2.639 -1.176
percent_college 0.9820 0.154 6.386 0.000 0.681 1.283
percent_lunch -7.3362 0.097 -75.758 0.000 -7.526 -7.146
income_poverty_ratio_estimate 0.0015 0.000 8.347 0.000 0.001 0.002
==============================================================================
Omnibus: 906.945 Durbin-Watson: 1.485
Prob(Omnibus): 0.000 Jarque-Bera (JB): 3296.536
Skew: 0.606 Prob(JB): 0.00
Kurtosis: 6.078 Cond. No. 7.07e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.07e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
We do have all four of these variables having statistically significant coefficients. We can use a residual plot for a graphical assessment of this model fit and see that we have a very similar residual plot to what we had with the full model.
new_y_hat_reduced = new_model_reduced.predict()
plt.figure(figsize=(6, 6))
plt.plot(new_y_hat_reduced, new_model_reduced.resid, 'ko', mec='w')
plt.axhline(0, color='r', linestyle='dashed', lw=2)
plt.xlabel('Predicted average ACT score', fontsize=16)
plt.ylabel('Residual average ACT score', fontsize=16)
plt.tick_params(labelsize=14)
plt.show()
This is again essentially no structure to the model, indicating that we do not need to include transformations of these particular predictors in order to improve the model. We can do a numerical assessment of the accuracy again using the mean absolute error.
mean_absolute_error(new_df['average_act'], new_model_reduced.predict())
1.1383027484766712
This look slightly larger than the full model. In fact, when we compare the accuracy of the model in terms of the mean absolute error or R-squared between the reduced model and the full model that included all six predictor variables, the reduced model has slightly smaller value.
new_mae_full = mean_absolute_error(new_df['average_act'], new_model.predict())
new_mae_reduced = mean_absolute_error(new_df['average_act'], new_model_reduced.predict())
new_r2_full = new_model.rsquared
new_r2_reduced = new_model_reduced.rsquared
pd.DataFrame({'New Mean Absolute Error': [new_mae_full, new_mae_reduced],
'New R-squared': [new_r2_full, new_r2_reduced]},
index=['new full model', 'new reduced model']).round(4)
| New Mean Absolute Error | New R-squared | |
|---|---|---|
| new full model | 1.1338 | 0.6333 |
| new reduced model | 1.1383 | 0.6315 |
We can also consider the significance of the difference between the model using an ANOVA.
sms.anova_lm(new_model_reduced, new_model)
| df_resid | ssr | df_diff | ss_diff | F | Pr(>F) | |
|---|---|---|---|---|---|---|
| 0 | 7222.0 | 16754.972033 | 0.0 | NaN | NaN | NaN |
| 1 | 7220.0 | 16670.754967 | 2.0 | 84.217066 | 18.236943 | 1.258157e-08 |
The reduced model is statistically equivalent to the full model in predictive accuracy (R² ≈ 0.63, MAE ≈ 1.14). The ANOVA test shows no significant loss of explanatory power after removing redundant predictors, confirming that simplifying the model improves interpretability without compromising performance. By replacing the median_income for the income_poverty_ratio_estimate made the new reduced model significant differences. Now, we are going to scale the predictors in the reduced model so that they have a mean of 0 and a standard deviation of 1. So that we can use the magnitude of the coefficients in this model to compare the relative importance of each of those predictor variables at contributing to our estimate of the average ACT score.
new_predictor_variables = ['rate_unemployment', 'percent_college', 'percent_lunch', 'income_poverty_ratio_estimate']
new_scaled_columns = [var + '_normalized' for var in new_predictor_variables]
print(new_scaled_columns)
['rate_unemployment_normalized', 'percent_college_normalized', 'percent_lunch_normalized', 'income_poverty_ratio_estimate_normalized']
new_scaler = StandardScaler().fit(new_df[new_predictor_variables])
new_df[new_scaled_columns] = new_scaler.transform(new_df[new_predictor_variables])
Now let's check and see that these new variables do in fact have a mean of 0 and a standard deviation of 1. We might expect there to be some small differences just due to calculation error, and they do have means of essentially zero and standard deviations of one.
new_df[new_scaled_columns].agg(['mean', 'std']).round(3)
| rate_unemployment_normalized | percent_college_normalized | percent_lunch_normalized | income_poverty_ratio_estimate_normalized | |
|---|---|---|---|---|
| mean | 0.0 | -0.0 | -0.0 | -0.0 |
| std | 1.0 | 1.0 | 1.0 | 1.0 |
So now we can fit the multiple linear regression model with those normalized predictors. So we will use again four values in the reduced model for them to have statistically significant coefficents.
new_model_normalize = smf.ols(
formula='average_act ~ rate_unemployment_normalized + percent_college_normalized + percent_lunch_normalized + income_poverty_ratio_estimate_normalized',
data=new_df).fit()
print(new_model_normalize.summary())
OLS Regression Results
==============================================================================
Dep. Variable: average_act R-squared: 0.631
Model: OLS Adj. R-squared: 0.631
Method: Least Squares F-statistic: 3094.
Date: Sun, 19 Oct 2025 Prob (F-statistic): 0.00
Time: 15:10:37 Log-Likelihood: -13293.
No. Observations: 7227 AIC: 2.660e+04
Df Residuals: 7222 BIC: 2.663e+04
Df Model: 4
Covariance Type: nonrobust
============================================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------------------------------
Intercept 20.2986 0.018 1132.926 0.000 20.263 20.334
rate_unemployment_normalized -0.1079 0.021 -5.109 0.000 -0.149 -0.067
percent_college_normalized 0.1619 0.025 6.386 0.000 0.112 0.212
percent_lunch_normalized -1.7184 0.023 -75.758 0.000 -1.763 -1.674
income_poverty_ratio_estimate_normalized 0.2222 0.027 8.347 0.000 0.170 0.274
==============================================================================
Omnibus: 906.945 Durbin-Watson: 1.485
Prob(Omnibus): 0.000 Jarque-Bera (JB): 3296.536
Skew: 0.606 Prob(JB): 0.00
Kurtosis: 6.078 Cond. No. 2.92
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
We see that we still have a much larger magnitude of a coefficient for the percent lunch variable than for either of the other three variables. So this says that the estimated change in the average ACT score is much larger magnitude when we would have a one standard deviation change in the percent lunch variable as comapred to others variables. We may be noting that R squared value looks like it is the same as what we had with our reduced model. Let's compare the accuracy between the original and normalized models.
new_mae_normalized = mean_absolute_error(new_df['average_act'], new_model_normalize.predict())
new_mae_reduced = mean_absolute_error(new_df['average_act'], new_model_reduced.predict())
new_r2_normalized = new_model_normalize.rsquared
new_r2_reduced = new_model_reduced.rsquared
pd.DataFrame({'New Mean Absolute Error': [new_mae_normalized, new_mae_reduced],
'New R-squared': [new_r2_normalized, new_r2_reduced]},
index=['new normalized model', 'new reduced model']).round(4)
| New Mean Absolute Error | New R-squared | |
|---|---|---|
| new normalized model | 1.1383 | 0.6315 |
| new reduced model | 1.1383 | 0.6315 |
In fact, when we compare all of these numerical measures, the mean absolute error or the R-squared, they are exactly the same between the model with the unscaled predictors and with these scaled predictors. This is because the transformation from the original predictors to the standardized versions does not lead to an overall change in the linear regression model, it is just changing what the coefficients in that model are .